Midterm II. Introduction to Artificial Intelligence. CS 188 Fall ˆ You have approximately 3 hours.

CS 188 Fall 2012 Introduction to Artificial Intelligence Midterm II ˆ You have approximately 3 hours. ˆ The exam is closed book, closed notes except a one-page crib sheet. ˆ Please use non-programmable calculators only. ˆ Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. All short answer sections can be successfully answered in a few sentences AT MOST. First name Last name SID EdX username First and last name of student to your left First and last name of student to your right For staff use only: Q1. December 21, 2012 /10 Q2. ayes Nets Representation /16 Q3. Variable Elimination /13 Q4. ayes Nets Sampling /10 Q5. Probability and Decision Networks /15 Q6. Election /12 Q7. Naïve ayes Modeling Assumptions /6 Q8. Model Structure and Laplace Smoothing /7 Q9. ML: Short Question & Answer /11 Total /100 1

THIS PAGE IS INTENTIONALLY LEFT LANK

Q1. [10 pts] December 21, 2012 A smell of sulphur (S) can be caused either by rotten eggs (E) or as a sign of the doom brought by the Mayan Apocalypse (M). The Mayan Apocalypse also causes the oceans to boil (). The ayesian network and corresponding conditional probability tables for this situation are shown below. For each part, you should give either a numerical answer (e.g. 0.81) or an arithmetic expression in terms of numbers from the tables below (e.g. 0.9 0.9). Note: be careful of doing unnecessary computation here. P (E) +e 0.4 e 0.6 P (S E, M) +e +m +s 1.0 +e +m s 0.0 +e m +s 0.8 +e m s 0.2 e +m +s 0.3 e +m s 0.7 e m +s 0.1 e m s 0.9 E S M P (M) +m 0.1 m 0.9 P ( M) +m +b 1.0 +m b 0.0 m +b 0.1 m b 0.9 (a) [2 pts] Compute the following entry from the joint distribution: P ( e, s, m, b) = (b) [2 pts] What is the probability that the oceans boil? P (+b) = (c) [2 pts] What is the probability that the Mayan Apocalypse is occurring, given that the oceans are boiling? P (+m + b) = 3

The figures and table below are identical to the ones on the previous page and are repeated here for your convenience. P (E) +e 0.4 e 0.6 P (S E, M) +e +m +s 1.0 +e +m s 0.0 +e m +s 0.8 +e m s 0.2 e +m +s 0.3 e +m s 0.7 e m +s 0.1 e m s 0.9 E S M P (M) +m 0.1 m 0.9 P ( M) +m +b 1.0 +m b 0.0 m +b 0.1 m b 0.9 (d) [2 pts] What is the probability that the Mayan Apocalypse is occurring, given that there is a smell of sulphur, the oceans are boiling, and there are rotten eggs? P (+m + s, +b, +e) = (e) [2 pts] What is the probability that rotten eggs are present, given that the Mayan Apocalypse is occurring? P (+e + m) = 4

Q2. [16 pts] ayes Nets Representation (a) [6 pts] Graph Structure: Conditional Independence Consider the ayes net given below. A C D E F AG H Remember that X Y reads as X is independent of Y given nothing, and X Y {Z, W } reads as X is independent of Y given Z and W. For each expression, fill in the corresponding circle to indicate whether it is True or False. (i) True False It is guaranteed that A (ii) True False It is guaranteed that A C (iii) True False It is guaranteed that A D {, H} (iv) True False It is guaranteed that A E F (v) True False It is guaranteed that G E (vi) True False It is guaranteed that F C D (vii) True False It is guaranteed that E D (viii) True False It is guaranteeed that C H G 5

(b) Graph structure: Representational Power Recall that any directed acyclic graph G has an associated family of probability distributions, which consists of all probability distributions that can be represented by a ayes net with structure G. For the following questions, consider the following six directed acyclic graphs: A C A C A G 1 G 2 G 3 C A C A C A G 4 G 5 G 6 C (i) [2 pts] Assume all we know about the joint distribution P (A,, C) is that it can be represented by the product P (A, C)P ( C)P (C). Mark each graph for which the associated family of probability distributions is guaranteed to include P (A,, C). G 1 G 2 G 3 G 4 G 5 G 6 (ii) [2 pts] Now assume all we know about the joint distribution P (A,, C) is that it can be represented by the product P (C )P ( A)P (A). Mark each graph for which the associated family of probability distributions is guaranteed to include P (A,, C). G 1 G 2 G 3 G 4 G 5 G 6 6

(c) Marginalization and Conditioning Consider a ayes net over the random variables A,, C, D, E with the structure shown below, with full joint distribution P (A,, C, D, E). The following three questions describe different, unrelated situations (your answers to one question should not influence your answer to other questions). A C D E (i) [2 pts] Consider the marginal distribution P (A,, D, E) = c P (A,, c, D, E), where C was eliminated. On the diagram below, draw the minimal number of arrows that results in a ayes net structure that is able to represent this marginal distribution. If no arrows are needed write No arrows needed. A D E (ii) [2 pts] Assume we are given an observation: A = a. On the diagram below, draw the minimal number of arrows that results in a ayes net structure that is able to represent the conditional distribution P (, C, D, E A = a). If no arrows are needed write No arrows needed. C D E (iii) [2 pts] Assume we are given two observations: D = d, E = e. On the diagram below, draw the minimal number of arrows that results in a ayes net structure that is able to represent the conditional distribution P (A,, C D = d, E = e). If no arrows are needed write No arrows needed. A C 7

Q3. [13 pts] Variable Elimination For the ayes net shown on the right, we are given the query P (, D +f). All variables have binary domains. Assume we run variable elimination to compute the answer to this query, with the following variable elimination ordering: A, C, E, G. A C D E F G (a) Complete the following description of the factors generated in this process: After inserting evidence, we have the following factors to start out with: P (A), P ( A), P (C ), P (D C), P (E C, D), P (+f C, E), P (G C, +f) When eliminating A we generate a new factor f 1 as follows: f 1 () = a P (a)p ( a) This leaves us with the factors: P (C ), P (D C), P (E C, D), P (+f C, E), P (G C, +f), f 1 () (i) [2 pts] When eliminating C we generate a new factor f 2 as follows: This leaves us with the factors: (ii) [2 pts] When eliminating E we generate a new factor f 3 as follows: This leaves us with the factors: (iii) [2 pts] When eliminating G we generate a new factor f 4 as follows: This leaves us with the factors: (b) [2 pts] Explain in one sentence how P (, D + f) can be computed from the factors left in part (iii) of (a)? (c) [1 pt] Among f 1, f 2,..., f 4, which is the largest factor generated, and how large is it? Assume all variables have binary domains and measure the size of each factor by the number of rows in the table that would represent the factor. For your convenience, the ayes net from the previous page is shown again below. 8

A C D E F G (d) [4 pts] Find a variable elimination ordering for the same query, i.e., for P (, D +f), for which the maximum size factor generated along the way is smallest. Hint: the maximum size factor generated in your solution should have only 2 variables, for a table size of 2 2 = 4. Fill in the variable elimination ordering and the factors generated into the table below. Variable Eliminated Factor Generated For example, in the naive ordering we used earlier, the first line in this table would have had the following two entries: A, f 1 (). For this question there is no need to include how each factor is computed, i.e., no need to include expressions of the type = a P (a)p ( a). 9

Q4. [10 pts] ayes Nets Sampling Assume the following ayes net, and the corresponding distributions over the variables in the ayes net: A C D P (A) a 3/4 +a 1/4 P ( A) a b 2/3 a +b 1/3 +a b 4/5 +a +b 1/5 P (C ) b c 1/4 b +c 3/4 +b c 1/2 +b +c 1/2 P (D C) c d 1/8 c +d 7/8 +c d 5/6 +c +d 1/6 (a) You are given the following samples: +a + b c d +a b + c d a + b + c d a b + c d +a b c + d +a + b + c d a + b c + d a b + c d (i) [1 pt] Assume that these samples came from performing Prior Sampling, and calculate the sample estimate of P (+c). (ii) [2 pts] Now we will estimate P (+c +a, d). Above, clearly cross out the samples that would not be used when doing Rejection Sampling for this task, and write down the sample estimate of P (+c +a, d) below. (b) [2 pts] Using Likelihood Weighting Sampling to estimate P ( a +b, d), the following samples were obtained. Fill in the weight of each sample in the corresponding row. Sample a + b + c d +a + b + c d +a + b c d a + b c d Weight (c) [1 pt] From the weighted samples in the previous question, estimate P ( a +b, d). (d) [2 pts] Which query is better suited for likelihood weighting, P (D A) or P (A D)? Justify your answer in one sentence. 10

(e) [2 pts] Recall that during Gibbs Sampling, samples are generated through an iterative process. Assume that the only evidence that is available is A = +a. Clearly fill in the circle(s) of the sequence(s) below that could have been generated by Gibbs Sampling. Sequence 1 1 : +a b c +d 2 : +a b c +d 3 : +a b +c +d Sequence 3 1 : +a b c +d 2 : +a b c d 3 : +a +b c d Sequence 2 1 : +a b c +d 2 : +a b c d 3 : a b c +d Sequence 4 1 : +a b c +d 2 : +a b c d 3 : +a +b c +d 11

Q5. [15 pts] Probability and Decision Networks The new Josh ond Movie (M), Skyrise, is premiering later this week. Skyrise will either be great (+m) or horrendous ( m); there are no other possible outcomes for its quality. Since you are going to watch the movie no matter what, your primary choice is between going to the theater (theater) or renting (rent) the movie later. Your utility of enjoyment is only affected by these two variables as shown below: (a) [3 pts] Maximum Expected Utility Compute the following quantities: EU(theater) = M P(M) +m 0.5 -m 0.5 M A U(M,A) +m theater 100 -m theater 10 +m rent 80 -m rent 40 EU(rent) = MEU({}) = Which action achieves MEU({}) = 12

(b) [3 pts] Fish and Chips Skyrise is being released two weeks earlier in the U.K. than the U.S., which gives you the perfect opportunity to predict the movie s quality. Unfortunately, you don t have access to many sources of information in the U.K., so a little creativity is in order. You realize that a reasonable assumption to make is that if the movie (M) is great, citizens in the U.K. will celebrate by eating fish and chips (F ). Unfortunately the consumption of fish and chips is also affected by a possible food shortage (S), as denoted in the below diagram. The consumption of fish and chips (F ) and the food shortage (S) are both binary variables. The relevant conditional probability tables are listed below: S M F P (F S, M) +s +m +f 0.6 +s +m -f 0.4 +s -m +f 0.0 +s -m -f 1.0 S M F P (F S, M) -s +m +f 1.0 -s +m -f 0.0 -s -m +f 0.3 -s -m -f 0.7 S P (S) +s 0.2 -s 0.8 You are interested in the value of revealing the food shortage node (S). Answer the following queries: EU(theater + s) = EU(rent + s) = MEU({+s}) = Optimal Action Under {+s} = MEU({ s}) = Optimal Action Under { s} = V P I(S) = 13

(c) [5 pts] Greasy Waters You are no longer concerned with the food shortage variable. Instead, you realize that you can determine whether the runoff waters are greasy (G) in the U.K., which is a variable that indicates whether or not fish and chips have been consumed. The prior on M and utility tables are unchanged. Given this different model of the problem: G F P (G F ) +g +f 0.8 -g +f 0.2 +g -f 0.3 -g -f 0.7 F M P (F M) +f +m 0.92 -f +m 0.08 +f -m 0.24 -f -m 0.76 M P(M) +m 0.5 -m 0.5 M A U(M,A) +m theater 100 -m theater 10 +m rent 80 -m rent 40 [Decision network] [Tables that define the model] F P (F ) +f 0.58 -f 0.42 G P (G) +g 0.59 -g 0.41 M G P (M G) +m +g 0.644 -m +g 0.356 +m -g 0.293 -m -g 0.707 G M P (G M) +g +m 0.760 -g +m 0.240 +g -m 0.420 -g -m 0.580 M F P (M F ) +m +f 0.793 -m +f 0.207 +m -f 0.095 -m -f 0.905 [Tables computed from the first set of tables. Some of them might be convenient to answer the questions below] Answer the following queries: MEU(+g) = MEU( g) = V P I(G) = 14

(d) VPI Comparisons We consider the shortage variable (S) again, resulting in the decision network shown below. The (conditional) probability tables for P (S), P (M), P (F S, M) and P (G F ) are the ones provided above. The utility function is still the one shown in part (a). Circle all statements that are true, and provide a brief justification (no credit without justification). (i) [1 pt] V P I(S) : V P I(S) < 0 V P I(S) = 0 V P I(S) > 0 V P I(S) = V P I(F ) V P I(S) = V P I(G) Justify: (ii) [1 pt] V P I(S G) : V P I(S G) < 0 V P I(S G) = 0 V P I(S G) > 0 V P I(S G) = V P I(F ) V P I(S G) = V P I(G) Justify: (iii) [1 pt] V P I(G F ) : V P I(G F ) < 0 V P I(G F ) = 0 V P I(G F ) > 0 V P I(G F ) = V P I(F ) V P I(G F ) = V P I(G) Justify: (iv) [1 pt] V P I(G) : V P I(G) = 0 V P I(G) > 0 V P I(G) > V P I(F ) V P I(G) < V P I(F ) V P I(G) = V P I(F ) Justify: 15

Q6. [12 pts] Election The country of Purplestan is preparing to vote on its next President! In this election, the incumbent President Purple is being challenged by the ambitious upstart Governor Fuschia. Purplestan is divided into two states of equal population, Redexas and lue York, and the lue York Times has recruited you to help track the election. R t S R t t t+1 R t+1 St N St+1 N S t S R t+1 S t+1 Drift and Error Models x D(x) E R (x) E (x) E N (x) 5.01.00.04.00 4.03.01.06.00 3.07.04.09.01 2.12.12.11.05 1.17.18.13.24 0.20.30.14.40-1.17.18.13.24-2.12.12.11.05-3.07.04.09.01-4.03.01.06.00-5.01.00.04.00 To begin, you draw the dynamic ayes net given above, which includes the President s true support in Redexas and lue York (denoted R t and t respectively) as well as weekly survey results. Every week there is a survey of each state, S R t and S t, and also a national survey S N t whose sample includes equal representation from both states. The model s transition probabilities are given in terms of the random drift model D(x) specified in the table above: P (R t+1 R t ) = D(R t+1 R t ) P ( t+1 t ) = D( t+1 t ) Here D(x) gives the probability that the support in each state shifts by x between one week and the next. Similarly, the observation probabilities are defined in terms of error models E R (x), E (x), and E N (x): P (S R t R t ) = E R (S R t R t ) P (St t ) = E (St t ) ( P (St N R t, t ) = E N St N R ) t + t 2 where the error model for each survey gives the probability that it differs by x from the true support; the different error models represent the surveys differing polling methodologies and sample sizes. Note that St N depends on both R t and t, since the national survey gives a noisy average of the President s support across both states. (a) Particle Filtering. First we ll consider using particle filtering to track the state of the electorate. Throughout this problem, you may give answers either as unevaluated numeric expressions (e.g. 0.4 0.9) or as numeric values (e.g. 0.36). (i) [2 pts] Suppose we begin in week 1 with the two particles listed below. Now we observe the first week s surveys: S R 1 = 51, S 1 = 45, and S N 1 = 50. Write the weight of each particle given this evidence: Particle Weight (r = 49, b = 47) (r = 52, b = 48) 16

The figures and table below are identical to the ones on the previous page and are repeated here for your convenience. R t S R t t S t R t+1 t+1 S N t S N t+1 S R t+1 S t+1 Drift and Error Models x D(x) E R (x) E (x) E N (x) 5.01.00.04.00 4.03.01.06.00 3.07.04.09.01 2.12.12.11.05 1.17.18.13.24 0.20.30.14.40-1.17.18.13.24-2.12.12.11.05-3.07.04.09.01-4.03.01.06.00-5.01.00.04.00 (ii) [2 pts] Now we resample the particles based on their weights; suppose our resulting particle set turns out to be {(r = 52, b = 48), (r = 52, b = 48)}. Now we pass the first particle through the transition model to produce a hypothesis for week 2. What s the probability that the first particle becomes (r = 50, b = 48)? (iii) [2 pts] In week 2, disaster strikes! A hurricane knocks out the offices of the company performing the lue York state survey, so you can only observe S R 2 = 48 and S N 2 = 50 (the national survey still incorporates data from voters in lue York). ased on these observations, compute the weight for each of the two particles: Particle Weight (r = 50, b = 48) (r = 49, b = 53) 17

(iv) [4 pts] Your editor at the Times asks you for a now-cast prediction of the election if it were held today. The election directly measures the true support in both states, so R 2 would be the election result in Redexas and t the result in lue York. To simplify notation, let I 2 = (S1 R, S1, S1 N, S2 R, S2 N ) denote all of the information you observed in weeks 1 and 2, and also let the variable W i indicate whether President Purple would win an election in week i: { 1 if R i+ i W i = 2 > 50 0 otherwise. For improved accuracy we will work with the weighted particles rather than resampling. Normally we would build on top of step (iii), but to decouple errors, let s assume that after step (iii) you ended up with the following weights: Particle Weight (r = 50, b = 48).12 (r = 49, b = 53).18 Note this is not actually what you were supposed to end up with! Using the weights from this table, estimate the following quantities: ˆ The current probability that the President would win: P (W 2 = 1 I 2 ) ˆ Expected support for President Purple in lue York: E[ 2 I 2 ] (v) [2 pts] The real election is being held next week (week 3). Suppose you are representing the current joint belief distribution P (R 2, 2 I 2 ) with a large number of unweighted particles. Explain using no more than two sentences how you would use these particles to forecast the national election (i.e. how you would estimate P (W 3 = 1 I 2 ), the probability that the President wins in week 3, given your observations from weeks 1 and 2). 18

Q7. [6 pts] Naïve ayes Modeling Assumptions You are given points from 2 classes, shown as rectangles and dots. For each of the following sets of points, mark if they satisfy all the Naïve ayes modelling assumptions, or they do not satisfy all the Naïve ayes modelling assumptions. Note that in (c), 4 rectangles overlap with 4 dots. 3 2.5 0 2 1.5 0.5 f 2 1 0.5 f 2 1 0 1.5 0.5 1 1 0 1 2 3 f 1 2 0.5 0 0.5 1 1.5 2 f 1 (a) Satisfies Does not Satisfy (b) Satisfies Does not Satisfy 0.2 0.4 0.6 0.5 0.8 1 0 f 2 1.2 0.5 1.4 1.6 1.8 f 2 1 1.5 2 0 0.5 1 f 1 2 0.5 0 0.5 1 1.5 2 f 1 (c) Satisfies Does not Satisfy (d) Satisfies Does not Satisfy 3.5 3 2.5 f 2 2 1.5 1 0.5 0 0.5 1 1 0 1 2 3 f 1 f 2 2 1.5 1 0.5 0 0.5 1 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 f 1 (e) Satisfies Does not Satisfy (f) Satisfies Does not Satisfy 19

Q8. [7 pts] Model Structure and Laplace Smoothing We are estimating parameters for a ayes net with structure G A and for a ayes net with structure G. To estimate the parameters we use Laplace smoothing with k = 0 (which is the same as maximum likelihood), k = 5, and k =. GA G Let for a given ayes net N the corresponding joint distribution over all variables in the ayes net be P N then the likelihood of the training data for the ayes net N is given by P N (x i ) x i Training Set Let L 0 A denote the likelihood of the training data for the ayes net with structure G A and parameters learned with Laplace smoothing with k = 0. Let L 5 A denote the likelihood of the training data for the ayes net with structure G A and parameters learned with Laplace smoothing with k = 5. Let L A denote the likelihood of the training data for the ayes net with structure G A and parameters learned with Laplace smoothing with k =. We similarly define L 0, L5, L for structure G. For each of the questions below, mark which one is the correct option. (a) [1 pt] Consider L 0 A and L5 A L 0 A L5 A L 0 A L5 A L 0 A = L5 A Insufficient information to determine the ordering. (b) [1 pt] Consider L 5 A and L A L 5 A L A L 5 A L A L 5 A = L A Insufficient information to determine the ordering. (c) [1 pt] Consider L 0 and L L 0 L L 0 L L 0 = L Insufficient information to determine the ordering. (d) [1 pt] Consider L 0 A and L0 L 0 A L0 L 0 A L0 L 0 A = L0 Insufficient information to determine the ordering. (e) [1 pt] Consider L A and L L A L L A L L A = L Insufficient information to determine the ordering. (f) [1 pt] Consider L 5 A and L0 L 5 A L0 L 5 A L0 L 5 A = L0 Insufficient information to determine the ordering. (g) [1 pt] Consider L 0 A and L5 L 0 A L5 L 0 A L5 L 0 A = L5 Insufficient information to determine the ordering. 20

Q9. [11 pts] ML: Short Question & Answer (a) Parameter Estimation and Smoothing. For the ayes net drawn on the left, A can take on values +a, a, and can take values +b and b. We are given samples (on the right), and we want to use them to estimate P (A) and P ( A). ( a, +b) ( a, +b) ( a, b) ( a, b) A ( a, b) ( a, b) ( a, +b) ( a, +b) ( a, b) (+a, +b) (i) [3 pts] Compute the maximum likelihood estimates for P (A) and P ( A), and fill them in the 2 tables on the right. A +a -a P (A) A P ( A) +a +b +a -b -a +b -a -b (ii) [3 pts] Compute the estimates for P (A) and P ( A) using Laplace smoothing with strength k = 2, and fill them in the 2 tables on the right. A +a -a P (A) A P ( A) +a +b +a -b -a +b -a -b (b) [2 pts] Linear Separability. You are given samples from 2 classes (. and +) with each sample being described by 2 features f 1 and f 2. These samples are plotted in the following figure. You observe that these samples are not linearly separable using just these 2 features. Circle the minimal set of features below that you could use alongside f 1 and f 2, to linearly separate samples from the 2 classes. 2 1.5 f 1 < 1.5 f 1 > 1.5 f 2 < 1 f 2 2 1 0.5 f 1 > 1.5 f 2 < 1 f 2 > 1 f 1 + f 2 f 2 0 0.5 f 1 < 1.5 f 2 > 1 f 2 1 1 1.5 2 2 1 0 1 2 f 1 Even using all these features alongside f 1 and f 2 will not make the samples linearly separable. (c) [3 pts] Perceptrons. In this question you will perform perceptron updates. You have 2 classes, +1 and 1, and 3 features f 0, f 1, f 2 for each training point. The +1 class is predicted if w f > 0 and the 1 class is predicted otherwise. You start with the weight vector, w = [1 0 0]. In the table below, do a perceptron update for each of the given samples. If the w vector does not change, write No Change, otherwise write down the new w vector. f 0 f 1 f 2 Class Updated w 1 7 8-1 1 6 8-1 1 9 6 +1 21